Conversation
This commit adds extensive benchmarking coverage for matrix operations as part of Phase 1 of the performance improvement plan. Changes: - Add Matrix.fs benchmark file with 14 comprehensive benchmarks - Benchmark element-wise operations (add, subtract, multiply, divide) - Benchmark scalar operations (add, multiply) - Benchmark matrix multiplication (matmul) - Benchmark matrix-vector operations (both directions) - Benchmark transpose operation - Benchmark row/column access patterns - Benchmark broadcast operations (addRowVector, addColVector) - Test with sizes: 10x10, 50x50, 100x100 Benchmarks use BenchmarkDotNet with MemoryDiagnoser to track allocations. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This was referenced Oct 11, 2025
Daily Perf Improver - Add benchmarks for matrix multiplication (was: Adaptive blocking for mmul)
#22
Closed
github-actions bot
added a commit
that referenced
this pull request
Oct 12, 2025
This commit significantly improves the performance of row vector × matrix multiplication by reorganizing the computation to exploit row-major storage and SIMD acceleration. ## Key Changes - Rewrote `Matrix.multiplyRowVector` to use weighted sum of matrix rows - Original: column-wise accumulation with strided memory access - Optimized: row-wise accumulation with contiguous memory and SIMD ## Performance Improvements Compared to baseline (from PR #20): | Size | Before | After | Improvement | |---------|-----------|-----------|-------------| | 10×10 | 84.3 ns | 55.2 ns | 34.5% faster | | 50×50 | 1,958 ns | 622.6 ns | 68.2% faster | | 100×100 | 9,208 ns | 1,905 ns | 79.3% faster | The optimization achieves 3.5-4.8× speedup for larger matrices by: 1. Eliminating strided column access patterns 2. Enabling SIMD vectorization on contiguous row data 3. Broadcasting vector weights efficiently across SIMD lanes 4. Skipping zero weights to reduce unnecessary computation ## Implementation Details The new implementation computes: result = v[0]*row0 + v[1]*row1 + ... + v[n-1]*row(n-1) This approach: - Accesses matrix rows contiguously (cache-friendly) - Broadcasts each weight v[i] to all SIMD lanes - Accumulates weighted rows directly into the result vector - Falls back to original scalar implementation for small matrices ## Testing - All 132 existing tests pass - Benchmark infrastructure added (Matrix.fs benchmarks) - Memory allocations unchanged 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds comprehensive benchmarking coverage for matrix operations as part of Phase 1 (Quick Wins) of the performance improvement plan. This establishes baseline performance metrics for all core matrix operations.
Performance Goal
Goal Selected: Add comprehensive matrix operation benchmarks (Phase 1, Priority: HIGH)
Rationale: The research plan identified that while vector operations had benchmarks, matrix operations had no benchmarking coverage. This PR fills that critical gap by adding 14 comprehensive benchmarks covering:
Changes Made
New Benchmarks Added
All benchmarks test three matrix sizes (10x10, 50x50, 100x100) and use
MemoryDiagnoserto track allocations.Element-wise Operations:
ElementWiseAdd- SIMD-accelerated element-wise additionElementWiseSubtract- SIMD-accelerated element-wise subtractionElementWiseMultiply- SIMD-accelerated Hadamard productElementWiseDivide- SIMD-accelerated element-wise divisionScalar Operations:
5.
ScalarAdd- Add scalar to all matrix elements6.
ScalarMultiply- Multiply all matrix elements by scalarMatrix Multiplication:
7.
MatrixMultiply- Standard matrix-matrix multiplication (matmul)Matrix-Vector Operations:
8.
MatrixVectorMultiply- Matrix × vector (SIMD-optimized)9.
VectorMatrixMultiply- Row vector × matrix (SIMD-optimized)Structure Operations:
10.
Transpose- Block-based transpose (16x16 blocks)Access Patterns:
11.
GetRow- Extract a single row (contiguous memory)12.
GetCol- Extract a single column (strided access)Broadcast Operations:
13.
AddRowVector- Add row vector to all matrix rows (SIMD)14.
AddColVector- Add column vector to all matrix columns (SIMD)Files Modified
benchmarks/FsMath.Benchmarks/Matrix.fs- New benchmark classbenchmarks/FsMath.Benchmarks/FsMath.Benchmarks.fsproj- Added Matrix.fs to compilationbenchmarks/FsMath.Benchmarks/Program.fs- Registered MatrixBenchmarks classApproach
--job shortPerformance Measurements
Test Environment
Results Summary by Operation Type
Element-wise Operations (10x10)
All element-wise operations show excellent SIMD performance with ~70ns latency:
Scalar Operations (10x10)
Scalar operations are slightly faster than element-wise:
Matrix Multiplication Scaling
Shows expected O(n³) scaling:
Matrix-Vector Operations (100x100)
Access Pattern Comparison (100x100)
Detailed Results Table
Key Observations
Performance Bottlenecks Identified
From these benchmarks, we can identify Phase 2 optimization opportunities:
Replicating the Performance Measurements
To replicate these benchmarks:
Results will be saved to
BenchmarkDotNet.Artifacts/results/in multiple formats (GitHub MD, HTML, CSV).Testing
✅ All benchmarks compile successfully
✅ All 14 matrix benchmarks × 3 sizes = 42 benchmarks discovered
✅ All benchmarks execute without errors
✅ Existing tests still pass (132 tests)
✅ No performance report files included in commit
Next Steps
This PR establishes comprehensive baseline measurements for matrix operations. Based on these measurements, future work from the performance plan includes:
Phase 1 (remaining):
Phase 2 (algorithmic improvements):
Phase 3 (advanced optimizations):
Related Issues/Discussions
Commands Used
🤖 Generated with Claude Code